Data curation: Tidy data & metadata

Simon Podhajsky

July 30, 2017

Tidy data

Slogan

Write documentation for humans. Write data for computers.

What challenges do you face?

Note on file organization

data/raw/ data/clean/ processing/ analysis/

Note on file formats

Preferred:

  • plain text
  • open (non-proprietary) format (= I don’t need to buy software to view this)

Data is tidy if…

  1. each observation occupies a single row
  2. each variable occupies a single column
  3. one type of observations resides in a single table.

Easy?

Yeah, right

  • Is observation…
    • a survey submission?
    • each response in that submission?
  • What separates kinds of observation?
  • What separates observations from variables? (work_phone vs. home_phone)
  • Might depend on context and discretion, but broad intuitions apply broadly

Negative definition

Messy data is usually messy in the following ways:

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of experimental unit stored in the same table
  • One type of experimental unit stored in multiple tables

Origin

Hadley Wickham’s (2014) Tidy Data paper.

Do read it for details and more formal definitions.

Messy dataset example: Your plans for SIPS

  • Remember that form you filled out? We’ll look at the CSV output from Google Forms.
  • (Examples are a little R heavy, but many tools do the same)

What are the problems here? (1/2)

knitr::kable(x[106, 1:4], row.names = FALSE)
Timestamp Name Email address Which events do you plan to attend on Sunday, July 30?
7/17/2017 10:45:31 Fred Satterfield px@kiwabmx.bzq [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day)

What are the problems here? (2/2)

Which events do you plan to attend on Monday, July 31? (not including lightning talks, unconference sessions, and hackathons that will continue or start on 2nd day) Which events do you plan to attend on Tuesday, August 1st?
Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) Workshop: Fundamentals of R (9:30a-12:30p)

Basic issues

  • Unwieldy column names
  • What timezone is the datetime? (More in the Metadata section)
  • Ambiguous use of commas
  • Identifying personal information
  • Brian’s social is listed on the wrong day
  • Mistakes? (Double submission, …)
# Read the file in with better headers
library(readr)
x <- read_csv('data/raw/plans_raw.csv',
              col_names = c("time_submitted", "name", "email", 
                            "day1", "day2", "day3"),
              col_types = list(
                col_datetime(format = '%m/%d/%Y %H:%M:%S'), 
                col_character(), col_character(), 
                col_character(), col_character(),
                col_character()),
              skip = 1) # skip 1 row (headers)
head(x)
## # A tibble: 6 x 6
##        time_submitted             name                    email
##                <dttm>            <chr>                    <chr>
## 1 2017-07-14 11:54:46  Bernadine Ratke              ad@xosz.tdi
## 2 2017-07-14 10:13:10    Juliane Kling     jbacqmnoug@qcsrv.hut
## 3 2017-07-14 10:14:02  Argelia DuBuque        sbinxjw@bprhc.ybo
## 4 2017-07-14 10:16:39 Williams Mueller xrugcmajh@qbaetkvwmj.kwe
## 5 2017-07-14 11:17:02    Davida Ledner              iokv@vl.dyt
## 6 2017-07-16 03:07:14     Ethan Klocko               kl@jbn.vmb
## # ... with 3 more variables: day1 <chr>, day2 <chr>, day3 <chr>
knitr::kable(x[106, 4:6], row.names = FALSE)
day1 day2 day3
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) Workshop: Fundamentals of R (9:30a-12:30p)

Mess #1: mixing different data sets

  • Our datasets centers on plans for specific events. Why do we have people’s emails here?
  • This becomes especially problematic with the next mess.

(tidy data illustration)

Tidy solution: divide the data units into tables you can re-merge

people <- select(x, time_submitted, name, email) %>% unique()
people$data_origin <- 'plans_raw.csv' # if many possible sources

# if non-existent, create an ID to join the tables by. 
# (Usually, you might have an ID for each participant at 
# this point, which you'd now deidentify.)
people$participant_ID = sample(1:nrow(people), nrow(people))

# separate out the dataset, enable later connection with index
x <- merge(x, people) %>% select(-time_submitted, -name, -email)
  • Should we separate the table of events and plans, too?

As a bonus, sensitive info is relegated to a single spot, and is easier to deal with:

# we don't want to share the name info, but we can extract 
# non-identifying metadata of interest
people <- transmute(people, 
                    time_submitted, participant_ID,
                    email_server = gsub('.+@([^@]+)$', '\\1', email),
                    tld = gsub('.+\\.(\\w{1,7}$)', '\\1', email),
                    number_names = length(strsplit(name, ' ')[[1]])
)
time_submitted participant_ID email_server tld number_names
2017-07-14 11:54:46 8 xosz.tdi tdi 2

Mess #2: Data stored in column names

  • Each column stores the date on which the event is held.
  • Even though we renamed it from ...August 1 to day1, that is still plan-level info
day1 day2 day3 data_origin participant_ID
[Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) Workshop: R Markdown (1:30-3:30p), Re-hack: Diversity and inclusion (morning), All Conference Social Event (Dinner+drinks; 7:30p-??) NA plans_raw.csv 77

Tidy solution: translate the data from wide to long

x <- tidyr::gather(x, event_day, event, -participant_ID)
x$event_day <- as.numeric(gsub('day', '', x$event_day)) # clean up
## Warning: NAs introduced by coercion
participant_ID event_day event
77 1 [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day)
53 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p)
  • Side effect: some people are now planning for NA. Why?

Mess #3: Multiple records in a single row

  • As many plans as there were checkboxes
  • Doing any sort of math on this is annoying

Tidy solution: One row per one planned event

(assuming we fixed the comma issue)

x <- separate_rows(x, event, sep = ', ')

We can now easily fix the mis-dating of Brian’s social, which - as an artifact of survey design - would otherwise still be considered a Day 1 event:

# With just one mis-dated event, we can just re-date it
x$event_day[x$event == paste0("[Saturday, July 29th, 7p] ", 
  "Dinner/drinks party at Brian Nosek's home (bus ",
  "transportation to/from Omni hotel all evening)")] <- 0

Mess #4: Multiple variables in a single column

participant_ID event_day event
50 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p)
  • All mashed up in event: event time, event name, and event type
# Structure of `event` is stable (Type: name (time)), so 
# a regular expression can extract it
#
# (Not shown: slight data manipulation to make this work for all events)
x <- extract(x, event, 
             c('event_type', 'event_name', 'event_time'), 
             regex = "(^[[:alnum:]-]+): (.+) \\((.+)\\)$", 
             remove = FALSE)

Mess #5: Multiple types of experimental unit in the same table

  • “I’m flexible” isn’t a plan for a specific event.
  • Question: Should such responses be removed or re-defined? When should curation remove responses?
    • Relevant: What does a plan for NA event mean?
participant_ID event_day event event_type event_name event_time
50 1 Workshop: Fundamentals of meta-analysis (9:30a-12:30p) Workshop Fundamentals of meta-analysis 9:30a-12:30p
flexible_responses <- c("Everything else will be scheduled during the conference depending on what has legs and group interest", 
                        "I am flexible; if there is a hackathon that needs help, let me know.")

# Removal:
# x <- filter(x, !(event %in% flexible_responses))

# Re-definition
x$event_type[x$event %in% flexible_responses] <- "Flexible"

Mess #6: One type of experimental unit stored in multiple tables

  • Can you imagine examples?
  • If we had a separate survey which also asked identifying info, the people table would probably run into duplicates.

Could we make things even cleaner?

  • Add event_date
  • Convert event_time to event_starttime and event_duration
  • Remove event, as we’ve extracted everything out of it
  • In fact, separate out events.csv into its own table and join by event_id

Real artists ship publish

x <- select(x, -event)
write_csv(x, 'data/clean/event_plans.csv')
write_csv(people, 'data/clean/people.csv')
participant_ID event_day event_type event_name event_time
50 1 Workshop Fundamentals of meta-analysis 9:30a-12:30p

Things are easier with a tidy data set

Tidy data are like a toothpaste

  • A tidy dataset is easy to push into “messy” forms, but backwards doesn’t work quite as well.
  • Quick exploration!

Question 1

Question 2

Question 3

Tidy tools

It’s not just R

  • Google Refine
  • Data Wrangler
  • Excel “Table” + pivot table shenanigans
  • All statistical software has some version of reshaping tools

…but if you’re looking for R, tidyverse is excellent

Tidy woes?

Metadata

  • Study- and table-level documentation
  • How the data was collected & why

The time to document metadata is always

Option 1: Standardized metadata

Pros

  • Very, very thoroughly standardized & documented
  • XML, so machine-readable
  • Connects into other existing frameworks
  • Ideal for extensive datasets

Cons

  • Far too thoroughly standardized - steep learning curve
  • XML, so not super human-friendly
  • Overkill for small-ish datasets

Tools

Option 2: “Readme” metadata

Documenting our SIPS dataset

Basics

  1. Title of Dataset: Preliminary plans of SIPS 2017 attendeees
  2. Author Information
    • Principal Investigator Contact Information
    • Associate or Co-investigator Contact Information
  3. Date of data collection (single date, range, approximate date) 20170713-20170725
  4. Geographic location of data collection (where was data collected?): online
  5. Information about funding sources that supported the collection of the data: COS / its funders

Basics (cont’d)

  1. Licenses/restrictions placed on the data: MIT
  2. Links to publications that cite or use the data: -
  3. Links to other publicly accessible locations of the data: -
  4. Links/relationships to ancillary data sets: SIPS 2017 registrations
  5. Was data derived from another source? -
  6. Recommended citation for the data: forthcoming

Data and file overview

  1. File List A. Filename: people.csv Short description: Properties of SIPS attendees B. Filename: event_plans.csv
    Short description: Plans for events by participant, with description of events
  2. Relationship between files: people.csv provides participant data for event_plans.csv (one-to-many)
  3. Additional related data collected that was not included in the current data package: flexible plans, participant registration

Data and file overview (cont’d)

  1. Are there multiple versions of the dataset? no If yes, list versions: Name of file that was updated: i. Why was the file updated? ii. When was the file updated?

Methodological information

  1. Description of methods used for collection/generation of data (include references): Google Form sent to registered participants via e-mail
  2. Methods for processing the data: Export to CSV, then running tidying_sips_plans.Rmd
  3. Instrument- or software-specific information needed to interpret the data: none, but R 3.4+ preferred
  4. Standards and calibration information, if appropriate: -

Methodological information (cont’d)

  1. Environmental/experimental conditions: -
  2. Describe any quality-assurance procedures performed on the data: -
  3. People involved with sample collection, processing, analysis and/or submission: Katie & Brian & … (collection), Simon (processing & analysis), David (submission)

event_plans.csv

  1. Number of variables: 5
  2. Number of cases/rows: 1260
  3. Variable List A. Name: participant_ID
    • Description: unique participant index linked to the person data in people.csv B. Name: event_type
    • Description: The format of the event the participant is planning to attend
    • Possible values: Social, Hack, Re-hack, Workshop, Flexible, NA (and meanings) C. Name:

event_plans.csv (cont’d)

  1. Missing data codes: -
  2. Specialized formats of other abbreviations used: -

people.csv

Resources